What the data about:
The data is from the paper: Modeling wine preferences by data mining from physicochemical properties, by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. It is about the quality of 4898 white wines, which is rated by at least 3 wine experts with 0 the lowest score and 10 the highest score. For each wine, 11 chemical features are collected.
Variables used in the data and their units
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
First several rows of the data
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
Dimensionality of the data
## [1] 4898 13
Structure
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Now, let’s analyze the data in detail.
As shown above, there are 4898 items in the data set, and each item is labeled by the quality level (from 0 to 10), and characterized by 11 features. First, I want to plot all the histograms of this 11 features regardless of their quality level to have a general understanding of the chemical content of white wines.
For the output variable quality, it is not surprising that only 7 numbers have counts, and they are 3, 4, 5, 6, 7, 8 and 9. The mode sits at 6, and wines scored 3 and 9 are really rare. If we look at the table for the quality
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
we can see that the number of samples is not balanced among the different quality groups, such as there are only 20 wines scored 3, and 5 wines scored 9. Thus, I want to regroup the data into three quality levels: low quality (with scores 3, 4, 5), mid quality (with score 6) and high quality (with scores 7, 8, 9), and call this new feature ‘label1’.
##
## low quality mid quality high quality
## 1640 2198 1060
This time, the number of samples in each category has the same order of magnitude. This new feature will be used in the Bivariate and multivariate sections.
Now, let’s look at some interesing histograms in more detail. The first thing that I noticed is this huge peak at the low level of residual.sugar, and the whole histogram is right skewed. To have a better view, I replot the residual.sugar histogram in log scale:
We can see there is a very obvious bimodal structure, which indicates that the sugar left after fermentation is likely to be either rather low or rather high, but very unlikely to be at a moderate level.I tried to plot other features in log scale, but no intereting properties was shown.
To investigate further the structures of the features, I put the box plot together with the histogram for each feature.
The distribution of the fixed.acidity is pretty symmetric, with only a small right skewness. It has some outliers on both tails, and the largest one is on the right.
The distribution of the volatile.acidity is heavily right skewed, and all the outliers are on the right tail. Replotting in log scale, we have
After this log transformation, we make the distribution of volatile.acidity more symmetric. If the final model need to include the volatile.acidity, the log transformation should be performed.
The distribution of citric.acid is roughly symmetric with some right skewness. I tried use the log transformation, but it does not improve the symmetry too much. I also noticed there is an interesting vertical line of dots around 0.5, which I guess might come from some wines that share the same source of fruit.
As I mentioned before, this distribution of residual.sugar is heavily skewed, and let’s look at it in log scale.
This is the same bimodal structure. From the boxplot, we can see an interesting but strange pattern: on the left end, the dots form vertical lines with clear separation between them, but on the right end, dots just random spread out.
I did the same plots for other features also, but do not want to show them all. The logic here is the same: if the distribution is heavily skewed, I will try to use log transformation to make it more symmetric.
I just to want to show the result for alcohol, since it is the most important feature in my following analysis.
All right, interesting patten shows here. It seems that the amount of alcohol contained is rather distrete, and this might come from the legal restrictions: you have to label the amount of alcohol very accurately in the wine.
I use white wine quality data. There are 4898 different wines in total, and their qualities are rated from 0 (worst) to 10 (best). In the data, the quality scores are 3, 4, 5, 6, 7, 8, 9. With a peak at score 6, the number of wines decrease monotonically to lower scores or higher scores.
For each wine, 11 chemical features are measured and reported in numbers with certain unit. Some of the features have pretty skewed distributions, and the residual.sugar has a very obivious bimodal structure when plotting in log scale. In addition, the alcohol level has a rather broad distribution.
The main feature of interest is the wine quality, which is rated by three experts. The question I am asking here is whether and how the white wine quality is predicted by the chemical features. In the original dataset, there are 7 levels of quality, but for some levels, the sample size is too small. To get better statistics, I regrouped the data into three categories: low quality, mid quality and high quality.
This question will be answered further in the Bivariate section where I talked about the relation between the wine qualtiy and the chemcial features. For now, I suspect that acidity levels, sulfur levels and alcohol levels will help my investigation. For example, high sulfur level might affect the taste and smell of the wines.
Yes, I did. Since for both low quality wines and high quality wines, the sample size is small, I combined wines with score 3, 4, and 5 into low quality wines, and combine wines with scores 7, 8, and 9 into high quality wines. Wines with score 6 are labeled as mid quality wines. I created a new factor variable called ‘label1’ to label wines as ‘low quality’, ‘mid quality’ and ‘high quality’ based on their qualites scores.
Yes, I performed a log transformation on the residual.sugar level. From the histogram of residual.sugar, I noticed that there is a huge peak for very low residual.sugar level, and the histogram is skewed. After the log transformation, a clear bimodal structure appears.
The boxplots of residual.sugar shows a gradient from left to right, where the pattern changes from order to randomness. However, the boxplots of the alcohol shows a consistent ordered pattern from left to right.
First, I am very interested in seeing what the differences are between low-quality and high-quality wines, and for the next plots, I want to do the following comparison: wines with score 3 and 4 (low quality) vs wines with score 8 and 9 (high quality), and for each group, there are around 180 samples.
It is interesting to note that some of the chemical features are very similar between high-quality wine and low-quality wine. However, for volatile.acidity, citric.acid, free.sulfur.dioxide, pH value, density and alcohol percentage, the distributions are different between high quality and low quality wines. The most interesting thing we can notice is that high-quality wines have higher alcohol percentage than low-quality wines. Consistently, since alcohol is less dense than water, high-quality wines have a lower density than low-quality wines. This negative correlation will be further discussed in bivariate session.
Now, let’s repeat the same high-quality and low-quality analysis, but expand the sample size to include 5 as low-quality, and 7 as high-quality. In this way, high quality wines include quality scores 7, 8 and 9; low quality wines include quality scores 3, 4, and 5.
This is very similar to the previous analysis. The alcohol percentage and density are still the most obvious indicators of the wine quality based on current analysis, especially the skewness of the alcohol distribution, where high quality wines has a left skewness, and low quality wines has a right skewness.
Now, I want to use the new feature I created ‘label1’, where the wines are grouped into three categories based their quality: low quality, mid quality, and high quality.
The above figure shows very comforting results that the curves for mid quality wines are in between the curves of high quality and low quality wines, which are a little bit different from the ones in previous figure. This is understandable, since the range on the x axis has changed and we are still using 20 as the bin number.
To have another view of the data, I scale the x axis to log10 and get the follwing figure.
One of the figures catch my eye, which is the residual.sugar. For both low quality and mid quality wines, the densities are bimodal, meaning the sugar left after fermentation is either low (around 2 g/dm^3) or high (around 10 g/dm^3). However, for high quality wines, the density of residual.sugar is a little bit flatter. Another interesing figure is the alcohol one. In the log10 scale, high quality is left skewed, low quality is right skewed, and the mid quality is not skewed.
I want to have a look at the box plots of wine quality against each of the features
It can be seen that some of the featues are really good indicators of wine qualities, and they are residual.sugar, chlorides, total.sulfur.dioxide, density, pH, and alcohol. In the multivariate section, I will explore more on how qualities depend on more than one features. Now, let’s look at some relations between different features.
We can see that, besides a strange outlier, there is a clear negative linear relation between the density and alcohol level. If we remove that outlier, the negative linearity is more clear.
This result is not surprising, since alcohol has a lower density than water, and more alcohol will make the density lower.
From the figure above, I noticed that fixed.acidity has a clear negative linear relation with pH level, but this type of relation is not obvious for volatile acid. For citric acid, there is a decreasing trend, but not very obvious.
We can see that total.sulfur.dioxide and free.sulfur.dioxide has a positive linear relation.
The most interesting feature in my dataset is that wines with different qualities have very different alcohol level distributions. Low quality wines tend to have a lower alcohol level, and the histogram is right skewed, while high quality wines tend to have a higher alcohol level, and the histogram is left skewed. Mid quality wines sit in the middle, and the histogram does not have an obvious skewness.
Beside, the wine quality also depends on several other features. The wine quality increases when pH increases, but decreases when residual.sugar, chlorides, total.sulfur.dioxide, and density increases. This is also shown in the freqpolygon plots. In fact, frequency polygons of many features show a very clear shift when the wine quality is changing. Just taking total.sulfur.dioxide for example, when the wine quality is decreasing, the polygons shift from low level to high level of sulfur dioxide. This is understandable, because higher level of sulfur dioxide makes the wines taste worse.
Although the median shifts for different quality levels, the general shapes of the distributions across different quality levels are very similar, such as the skewness.
Yes. The density of the wine has a negative linear relation with the alcohol level, and the total.sulfur.dioxide has a positive linear relation with the free.sulfur.dioxide.
The strongest relationship is between the wine quality and the alcohol level. Besides this, the density, pH level, residual.sugar, chlorides, and total.sulfur.dioxide are also correlated strongly with the wine quality. These features will be used in the predictive model.
This plot is the same as the one in the last section, but colored based on their quality level. A clear pattern is shown, where low quality wines accumulate at the left tail, mid quality wines at middle, and high quality wines at the right tail.
Let’s look at more of this type of figures.
The pattern in the above figures is not clear, and let’s take a log scale
The pattern is still not very clear, although both pH level and fixed.acidity show shifts of the median when quality level changes. However, I may still use this two features in the final classification model.
This shows some vague clusters that separate low quality wines from the other two. I will try to use this two features in the final model.
fit <- rpart(I(label1) ~ I(alcohol + density
+ log10(total.sulfur.dioxide)
+ log10(free.sulfur.dioxide)),
data = wwq_lmh,
method = 'class')
printcp(fit)
##
## Classification tree:
## rpart(formula = I(label1) ~ I(alcohol + density + log10(total.sulfur.dioxide) +
## log10(free.sulfur.dioxide)), data = wwq_lmh, method = "class")
##
## Variables actually used in tree construction:
## [1] I(alcohol + density + log10(total.sulfur.dioxide) + log10(free.sulfur.dioxide))
##
## Root node error: 2700/4898 = 0.55125
##
## n= 4898
##
## CP nsplit rel error xerror xstd
## 1 0.086667 0 1.00000 1.00000 0.012892
## 2 0.042222 1 0.91333 0.92000 0.012959
## 3 0.035556 2 0.87111 0.84778 0.012933
## 4 0.010000 3 0.83556 0.83778 0.012922
From this, we can see that the training error is 0.460 (0.55125 * 0.83556), and the cross-validation error is 0.465 (0.55125 * 0.84407).
I tried three pairs of the features. Among them, alcohol and density strengthened each other and showed clear clustering of wines based on their quality. Besides, In log scale, total.sulfur.dioxide and free.sulfur.dioxide could also cluster the wines, although not as well as the first pair. However, the fixed.acidity and pH could not show any structures.
I really did not expect that the alcohol level has such a strong relation with wine qualities, and a clear clustering of wines based on their alcohol level and density is surely surprising to me. However, I am bit confused about why there is no other strong indicators of the wine quality. Probably, I need to put more effort in this dataset in order to get more relations out of it.
Yes, I built a tree model using rpart. Both the traing error and cross validation error is around 0.46, which is better than pure guess (0.67, I used 3 categories instead of 7). However, this model is pretty preliminary, and I don’t think I put enough time on this. I need to try more algorithms.
This figure shows distributions of the alcohol for the three wine qualities, which is straightforward and informative about the overall relations between alcohol and wine quality. The distribution has a right skewness for low quality wine, and left skewness for high quality wine.
This is the strongest relation that I can get out of the data (one outlier removed), and thisnegative linearity is understandable, since alcohol has a smaller density than water. This two features–alchohol and density–are strong indicators of wine qualities.
This is the same plot as previous one but with colors indicating different wine qualities. We can see that there are three clusters, red dots accumulate at left representing low quality wines, green dots in the center represening mid quality wines, and blue dots at right representing high quality wines.
The dataset contains ratings and 11 chemical featues of 4898 different white wines. Myquestion is whether and how the wine quality depends on its chemical features. I explored relations between the wine quality and features, as well as relations among features. At last, I built a predictive tree model to do the classification.
At the beginning, I saw many features vary their distributions for different wine qualities. For example, the distribution of the residual.sugar is bimodal for both mid and low wine qualities, but rather flat for high quality wines. However, when I was doing the multivariate analysis, I found that this difference does not help distinguish wines. What is surprising to me is that the alchohol level is a very strong indicator of the wine qualities: good wines tend to have higher alchohol levels. I am not clear what the reason is, and perhaps, wines with more alchohol generally taste better. Besides the alchohol level, the amount of sulfur dioxide also affects the wine quality. The more sulfur dioxide, the worse the wine quality is, which makes sense since sulfur dioxide does not smell good. I built a tree model trying to predict the wine quality using alchohol level, density, and sulfur dioxide. The model achieves an error rate of 0.46, which is better than if one just guesses randomly (in this case, the error rate will be 0.67).
Of course, the analysis and modeling so far are still preliminary, and further investigation on this dataset is needed. First, I want to know more about the background of the data, such as the general experimental procedures and how the data were collected. Second, more combinations of the features should be investigated, which might provide new insights about their relations to the wine quality. At last, other models should be tested. I used a classification model, and maybe a regression model could give better results (don’t use the 3 categories I created here, but treat the original ratings as a continuous variable).